Investigating characteristics of movies using IMDB dataset.

Apr 27, 2018

Table of Contents

Introduction

After briefly going through the IMDB movie dataset, one can start to notice some correlations or trends between various characterstics of the movie. The pertinant business question that any Data Analyst would ask when browsing through this data set is to find out what characterstics of movies produce the highest revenue. This investigation and reaserch would help the film industry to work on those characterstics that would generate the highest revenue and by extension increase the movies popularity

It is due to these above mentioned reasons, that I have decided to keep one major overaching question that would drive the entire investigation and from which other sub-questions and trends would be discovered.

The focus of our investigation would revolve around

"What characterstics and traits of a movie generate higher revenue". Is their a positive correlation between one characterstic and revenue and can one conclude that this correlation might also be defined as causation?

From this other subquestions can be derived and various other trends explored for example:

1. Does the budget have a direct correlation with the revenue produce. Our normal understanding would be that the higher the investment, the greater the revenue that shall be generated. We shall use the data to either support or disprove this statement.

2. Do any other characterstics support any other co-relation either positive or negative for high revenue? Our further exploratory analysis would lead us to ask more questions.
The other significatiant investigation we shall be doing is understanding trends within the dataset independent of one another to come to conclusions. Whether certain trends show a decline in a particular characterstic or help increase it.

In [106]:
# Importing the important libraries that we shall use in this investigation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
% matplotlib inline
import matplotlib
sns.set(style="ticks", color_codes=True)
UsageError: Line magic function `%` not found.

Data Wrangling

General Observations

  1. Too many null(or '0' values in the budget_adj and revenue_adj coloumn. We cannot take weighted average as almost half of the data is missing so shall lead to inaccurate results.
  2. release_date, revenue_adj, budget_adj not in the correct datatype.
  3. Results show one duplicated values.
  4. Corrupt values found in 'directors' coloumn.
  5. Cast, Genre, production_companies contain numerous values per cell delimite by '|'. We might need to work on a mechanism to extract them
In [107]:
#Load dataset 
# Original Dataset from https://www.kaggle.com/tmdb/tmdb-movie-metadata
df = pd.read_csv('tmdb-movies.csv')
In [109]:
#Data wranging. Lets investigate the data. 

# df.describe()
df.info();
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
id                      10866 non-null int64
imdb_id                 10856 non-null object
popularity              10866 non-null float64
budget                  10866 non-null int64
revenue                 10866 non-null int64
original_title          10866 non-null object
cast                    10790 non-null object
homepage                2936 non-null object
director                10822 non-null object
tagline                 8042 non-null object
keywords                9373 non-null object
overview                10862 non-null object
runtime                 10866 non-null int64
genres                  10843 non-null object
production_companies    9836 non-null object
release_date            10866 non-null object
vote_count              10866 non-null int64
vote_average            10866 non-null float64
release_year            10866 non-null int64
budget_adj              10866 non-null float64
revenue_adj             10866 non-null float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.7+ MB
In [118]:
#I saw some zeros when browsing the data. Let me see how many they are randomly
# df['budget_adj'].head(80)
df['revenue'].head(60)
# df.head(4)
Out[118]:
0     1513528810
1      378436354
2      295238201
3     2068178225
4     1506249360
5      532950503
6      440603537
7      595380321
8     1156730962
9      853708609
10     880674609
11     183987723
12      36869414
13     243637091
14    1405035767
15     155760117
16     325771424
17     518602163
18     542351353
19     650523427
20     209035668
21      91709827
22     470490832
23     569651467
24     133346506
25     682330139
26     215863606
27     403802136
28      88346473
29     311256926
30      29355203
31     102069268
32     287506194
33     162610473
34     150170815
35      35401758
36      22354572
37     331926147
38      71561644
39      62076141
40     201634991
41     140396650
42      34441873
43     108145109
44      42629776
45      14333790
46     368871007
47      85512300
48             0
49       9064511
50      40272135
51     194564672
52     203427584
53     101134059
54     133718711
55      36606743
56      30523226
57      30418560
58      64191523
59      17752940
Name: revenue, dtype: int64
In [122]:
# Below shows significant amount of missing values. 
print('Count of missing values(zeros) in budget_adj colm')
df[df['budget_adj'] == 0].count()
Count of missing values(zeros) in budget_adj colm
Out[122]:
id                      5696
imdb_id                 5687
popularity              5696
budget                  5696
revenue                 5696
original_title          5696
cast                    5631
homepage                1229
director                5656
tagline                 3488
keywords                4534
overview                5692
runtime                 5696
genres                  5674
production_companies    4802
release_date            5696
vote_count              5696
vote_average            5696
release_year            5696
budget_adj              5696
revenue_adj             5696
dtype: int64
In [125]:
# Below shows significant amount of missing values. 
print('Count of missing values(zeros) in revenue_adj colm')
df[df['revenue_adj'] == 0].count()
Count of missing values(zeros) in revenue_adj colm
Out[125]:
id                      6016
imdb_id                 6006
popularity              6016
budget                  6016
revenue                 6016
original_title          6016
cast                    5945
homepage                1271
director                5973
tagline                 3656
keywords                4758
overview                6012
runtime                 6016
genres                  5993
production_companies    5082
release_date            6016
vote_count              6016
vote_average            6016
release_year            6016
budget_adj              6016
revenue_adj             6016
dtype: int64
In [127]:
#Search for missing values
df.duplicated().sum() 
# only one duplicate value
Out[127]:
1

Data Cleaning

We'll clean and extract the relavent data over here. However, since priminarly investigations show that certain coloums(e.g revenue, budget) have a high amount of null(or '0'/Zero values). Removing them in the earlier stage would significantly reduce the size of the dataset therefore making whatever conclusions we draw less credible. We shall therfore decide to remove them(missing values) accordingly to the needs of the questions/exploratory analysis we do.

1. Removing the following coloumns:

a. Removing budget and revenue coloumns since 'revenue_adj' and 'budget_adj' is more appropriate since they take into consideration inflation.
b. Dropping the following coloumns since we do not find them relavent to the dataset. We shall keep the 'id' coloumn as a unique identifier for the row. [imdb_id, homepage, tagline, overview]

  1. Changing the type of the following coloumn.
    - a. release date to data type

  2. Removing duplicate (NaN) rows from the dataframe

Note that we shall further clean the dataset when we further investigate the dataset. Cleaning or removing values right now will cause a significant decrease in data

In [129]:
df.head(2)
Out[129]:
id imdb_id popularity budget revenue original_title cast homepage director tagline ... overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... http://www.jurassicworld.com/ Colin Trevorrow The park is open. ... Twenty-two years after the events of Jurassic ... 124 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 6/9/15 5562 6.5 2015 1.379999e+08 1.392446e+09
1 76341 tt1392190 28.419936 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... http://www.madmaxmovie.com/ George Miller What a Lovely Day. ... An apocalyptic story set in the furthest reach... 120 Action|Adventure|Science Fiction|Thriller Village Roadshow Pictures|Kennedy Miller Produ... 5/13/15 6185 7.1 2015 1.379999e+08 3.481613e+08

2 rows × 21 columns

In [130]:
# After discussing the structure of the data and any problems that need to be cleaned, 
# perform those cleaning steps in the second part of this section.
# Removing budget and renvuue since revenue_adj and budget_adj is more appropriate since it considers inflation 
df.drop(['imdb_id','budget', 'revenue', 'homepage'], axis=1, inplace=True)
In [138]:
#renaming to more descriptive coloumns.
df.rename(index=str, columns={"budget_adj": "total_budget", "revenue_adj": "total_revenue"}, inplace=True)
#changing type of date
df['release_date'] = pd.to_datetime(df['release_date'])
df.head(2)
Out[138]:
id popularity original_title cast director tagline keywords overview runtime genres production_companies release_date vote_count vote_average release_year total_budget total_revenue
0 135397 32.985763 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow The park is open. monster|dna|tyrannosaurus rex|velociraptor|island Twenty-two years after the events of Jurassic ... 124 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 2015-06-09 5562 6.5 2015 1.379999e+08 1.392446e+09
1 76341 28.419936 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... George Miller What a Lovely Day. future|chase|post-apocalyptic|dystopia|australia An apocalyptic story set in the furthest reach... 120 Action|Adventure|Science Fiction|Thriller Village Roadshow Pictures|Kennedy Miller Produ... 2015-05-13 6185 7.1 2015 1.379999e+08 3.481613e+08
In [139]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 10865 entries, 0 to 10865
Data columns (total 17 columns):
id                      10865 non-null int64
popularity              10865 non-null float64
original_title          10865 non-null object
cast                    10789 non-null object
director                10821 non-null object
tagline                 8041 non-null object
keywords                9372 non-null object
overview                10861 non-null object
runtime                 10865 non-null int64
genres                  10842 non-null object
production_companies    9835 non-null object
release_date            10865 non-null datetime64[ns]
vote_count              10865 non-null int64
vote_average            10865 non-null float64
release_year            10865 non-null int64
total_budget            10865 non-null float64
total_revenue           10865 non-null float64
dtypes: datetime64[ns](1), float64(4), int64(4), object(8)
memory usage: 1.5+ MB
In [140]:
#removing duplicate values
x = df.duplicated().sum()
print("No of Duplicate Values", x)
df.drop_duplicates(inplace=True)
y = df.duplicated().sum()
print("No of Duplicate Values after cleaning", y)
No of Duplicate Values 0
No of Duplicate Values after cleaning 0

Exploratory Data Analysis

Research Question 1: Are there any features/properties in the data which are correlated to or effect revenue?

First make a copy of the original cleaned dataframe and perform changes with regards to this question only on this dataframe so that the original remains unaffected.
Cleaning data - and removing all those values which are empty in both total_revenue and total_budget. The removal of values reduces the dataset to 50% therefore one can question the crediblity of the analysis.

In [142]:
# 1 Copy the dataframe so that the main dataframe remains unaffected
df_q2 = df.copy(deep=True)
In [143]:
# Cleaning data - and removing all those values which are empty in both revenue and budget. 
# The removal of values reduces the dataset to 50% therefore one can question the crediblity of the analysis. 
# 2 remove all null values
df_q2 = df_q2[df_q2['total_revenue'] != 0]
df_q2 = df_q2[df_q2['total_budget'] != 0]
#check the data
df_q2.head(2)
Out[143]:
id popularity original_title cast director tagline keywords overview runtime genres production_companies release_date vote_count vote_average release_year total_budget total_revenue
0 135397 32.985763 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow The park is open. monster|dna|tyrannosaurus rex|velociraptor|island Twenty-two years after the events of Jurassic ... 124 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 2015-06-09 5562 6.5 2015 1.379999e+08 1.392446e+09
1 76341 28.419936 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... George Miller What a Lovely Day. future|chase|post-apocalyptic|dystopia|australia An apocalyptic story set in the furthest reach... 120 Action|Adventure|Science Fiction|Thriller Village Roadshow Pictures|Kennedy Miller Produ... 2015-05-13 6185 7.1 2015 1.379999e+08 3.481613e+08
In [150]:
# 3 a Plot of budget and revenue
sns.pointplot(x="total_budget", y="total_revenue", data=df_q2);
# sns.pointplot(x="total_revenue", y="total_budget", data=df_q2);
In [193]:
# 3 b Scatter plot of budget and revenue, shows better correlation
sns.regplot(x="total_budget", y="total_revenue", data=df_q2, scatter_kws={'alpha':0.3});

Plot the graph between budget and revenue. The graph/scatter plot shows that both [1] the budget and revenue are positivly correlated . As the budget increases the average revenue of a movie also increases.

Q1 - b) Similary lets see if there is any relationship between movie duration and revenue?

In [194]:
# 3 c  Graph between populaity and revenue shows that there isn't any correlation betweeen these two values
# matplotlib.pyplot.scatter(df_q2['popularity'] ,df_q2['total_revenue']);
# matplotlib.pyplot.xlabel("Popularity")
# matplotlib.pyplot.ylabel("Revenue");
sns.pointplot(x="popularity", y="total_revenue", data=df_q2);

Lets remove the outliers present in our data.

In [192]:
# 4 clear outliers
df_q2_removeOutliers = df_q2[df_q2['popularity'] < 15]
In [186]:
#matplotlib.pyplot.scatter(df_q2_removeOutliers['popularity'] ,df_q2_removeOutliers['total_revenue']);
#matplotlib.pyplot.xlabel("Popularity")
#matplotlib.pyplot.ylabel("Revenue");
sns.pointplot(x="popularity", y="total_revenue", data=df_q2_removeOutliers);

The above graph shows that there is a significant relationship between popularity and revenue The more popularity a particular movie has the more revenue it usually tends to produce.

Q1 - c) Reputed directors traditionally do have a impact on the movie revenue. Lets first find out the directors which directed the most movies and then compare the revenues made by the movies they directed. We will find out whether a particular director directing a movie has any(if any) significant impact on the revenues.

In [187]:
# 5 Drop all null values. 
df_q2 = df_q2.dropna()
# Sieve through the dataset by using a group by function and find out the list of directors which produced the most
# movies in descending order. Since the number of directors in this dataset is huge, we shall only consider the top few who 
#have directed a greater number of movies. 
top_actors = (df_q2.groupby('director')['original_title'].count()).sort_values(ascending=False)
In [50]:
#List of top directors aka top_actors
top_actors[:30].plot(kind='bar', figsize=(15,11), title='No. of movies directed by top 30 directors').yaxis.label.set_visible(False);

One can observe from the above graph that Steven Spielberg and Ridley Scott and Client Eastwood have directed a marked greater number of movies than the rest.


The below code first sieves through dataframe and extracts all those rows which contains the names of the top 21 directors. The for loop extracts the row from the dataframe and keeps on appended it to the df_topDirectors dataframe.

In [188]:
# 6 Now create a new dataframe which only contain the rows belonging to the movies produced by these directors
df_temp = pd.DataFrame()
for actor,v in top_actors[:21].items():
    df_temp =df_temp.append(df_q2[df_q2['director'].str.contains(actor)],ignore_index=True)
df_topDirectors = df_temp
  • Once the dataframe which contains the movies directed by the top directors is produced. The 'total_revenue' column which is of object type is converted to float so that the groupby mean clause can be executed on this coloumn.
  • Having done that the grouped data is then stored in the df_topDirectorsRevenue dataframe and the non-essential coloumns which are not required for plotting are removed.
  • Sorting of the revenue is done so that the data is visually clear and appealing.
In [189]:
# 7 a convert to float datatype
df_topDirectors['total_revenue'] = df_topDirectors['total_revenue'].astype(str).astype('float64')
# 7 b take the mean of the revenue produced by all the movies directed by the director. 
df_topDirectorsRevenue = df_topDirectors.groupby(['director']).mean()
# 7 c Extract the relavent coloumns needed for plotting
df_topDirectorsRevenue =df_topDirectorsRevenue.drop(df_topDirectorsRevenue.columns.difference(['director','total_revenue']), 1)
# 7 d Sort the total_revenue coloumn in descending order
df_topDirectorsRevenue = df_topDirectorsRevenue.sort_values(by='total_revenue', ascending=1)
In [190]:
df_topDirectorsRevenue.plot(kind='barh', figsize=(10, 10), sort_columns='True', title='Average Revenue of Movies per directors' );
In [55]:
#one might want to fiddle with the data and use a pie chart
df_topDirectorsRevenue[:3].plot(kind='pie', subplots=True, figsize=(10, 10), sort_columns='True', title='Average Revenue of Movies per directors', legend=False );

The bar chart above shows that directors who have directed a total of 18 movies. We can infer that the movies directed by the topmost directors from Peter Jackson to Tim Burton tend to have a greater revenue. Lets further investigate if this inference hold any crediblity by looking into the details of the top director - Peter Jackson

In [195]:
# 8 a Extract rows containing movies directed by Peter Jackson
Peter_Jackson=df_topDirectors[df_topDirectors['director'].str.contains('Peter Jackson')].sort_values(by='total_revenue', ascending=0)
In [196]:
Peter_Jackson[['id', 'director', 'original_title' ,'total_revenue']]
Out[196]:
id director original_title total_revenue
295 122 Peter Jackson The Lord of the Rings: The Return of the King 1.326278e+09
292 121 Peter Jackson The Lord of the Rings: The Two Towers 1.122902e+09
291 120 Peter Jackson The Lord of the Rings: The Fellowship of the Ring 1.073080e+09
294 49051 Peter Jackson The Hobbit: An Unexpected Journey 9.658933e+08
296 57158 Peter Jackson The Hobbit: The Desolation of Smaug 8.970944e+08
289 122917 Peter Jackson The Hobbit: The Battle of the Five Armies 8.797523e+08
297 254 Peter Jackson King Kong 6.141098e+08
290 7980 Peter Jackson The Lovely Bones 9.505945e+07
299 10779 Peter Jackson The Frighteners 4.081567e+07
293 1024 Peter Jackson Heavenly Creatures 4.485618e+06
298 763 Peter Jackson Braindead 3.770420e+05

Lets have a look at the data in graphical format aswell

In [197]:
Peter_Jackson[['total_revenue']].plot(kind='bar', sort_columns='True', title='Movie revenue directed by Peter Jackson');

Lets also have a look at the movies directed by Stevin Speilburg before making some conclusions

In [198]:
# 8 b Extract rows containing movies directed by Steven Spielburg
StevenSp=df_topDirectors[df_topDirectors['director'].str.contains('Steven Spielberg')].sort_values(by='total_revenue', ascending=0)
In [199]:
StevenSp[['id', 'director', 'original_title' ,'total_revenue']]
Out[199]:
id director original_title total_revenue
24 578 Steven Spielberg Jaws 1.907006e+09
18 601 Steven Spielberg E.T. the Extra-Terrestrial 1.791694e+09
26 329 Steven Spielberg Jurassic Park 1.388863e+09
1 840 Steven Spielberg Close Encounters of the Third Kind 1.092965e+09
17 85 Steven Spielberg Raiders of the Lost Ark 9.351169e+08
20 89 Steven Spielberg Indiana Jones and the Last Crusade 8.340611e+08
3 217 Steven Spielberg Indiana Jones and the Kingdom of the Crystal S... 7.966945e+08
15 87 Steven Spielberg Indiana Jones and the Temple of Doom 6.989810e+08
11 74 Steven Spielberg War of the Worlds 6.607145e+08
19 857 Steven Spielberg Saving Private Ryan 6.445564e+08
25 424 Steven Spielberg Schindler's List 4.849410e+08
22 879 Steven Spielberg Hook 4.816965e+08
7 180 Steven Spielberg Minority Report 4.344417e+08
6 640 Steven Spielberg Catch Me If You Can 4.268546e+08
4 17578 Steven Spielberg The Adventures of Tintin 3.605580e+08
9 330 Steven Spielberg The Lost World: Jurassic Park 3.111887e+08
10 873 Steven Spielberg The Color Purple 2.965582e+08
2 644 Steven Spielberg A.I. Artificial Intelligence 2.905407e+08
13 594 Steven Spielberg The Terminal 2.533052e+08
5 57212 Steven Spielberg War Horse 1.721504e+08
0 296098 Steven Spielberg Bridge of Spies 1.496016e+08
12 612 Steven Spielberg Munich 1.455540e+08
21 11352 Steven Spielberg Always 1.304020e+08
8 11831 Steven Spielberg Amistad 1.005261e+08
14 11519 Steven Spielberg 1941 9.541184e+07
16 15301 John Landis|Steven Spielberg|George Miller|Joe... Twilight Zone: The Movie 6.447725e+07
23 5121 Steven Spielberg The Sugarland Express 5.660523e+07

Lets have a look at the data in graphical format aswell

In [200]:
StevenSp[['total_revenue']].plot(kind='bar', sort_columns='True', title='Average Revenue of Movies directed by Steven Speilburg');

Having evaluated our findings from the above dataset(tables and graphs), there isn't substancial evidence to suggest that the movies produced by such directors would generate high revenue as these table show that while some movies did produced a marked high revenue well above the global mean revenue, there were significant other movies that did not perform as well.

Next we shall see the trends in revenue over the years.

In [201]:
sns.set(rc={'figure.figsize':(26.7,8.27)})
sns.barplot(x="release_year", y="total_revenue", data=df_q2);
In [203]:
df_q2['total_revenue'].mean()
Out[203]:
148448631.16558233

The above graph shows that the revenue(with fluctuations) was higher from the 60's to the 70's while it remained almost the same and consistant with minor fluctuations after the 90's However, it must also be noted that the during the data cleaning process alot of missing data was from 1960s to 1980, and when cleaned only few data remained which contained the revenue of the movies. This does make one consider the crediblity of this result.

So considering our inital question that 'Are there any features/properties in the data which are correlated to or effect revenue?'
We find that

  • There is a positive correlation between budget and revenue. Movies with a higher budget tend to generate a higher revenue. However considering the fact that we lost 50% of our dataset due to null values, one might question the crediblity of this inference.
  • There is also a positive correlation between popularity and revenue.
  • There isn't substancial evidence to suggest that the movies produced by popular directors would generate high revenue

    However we did discover that Steven Spielberg and Peter Jackson have directed a marked greater number of movies than the rest.

  • Between the early 60's to 70's, we see a fluctuation with the movie revenue, averaging with a marked greated revenue than seen after the 80s. As the years have progressed into the 80's, the revenue generated by movies have remained steady. The limitation of this observation is the fact that there is a few data avaliable for the years of 1960s to 70s hence the reliablity of the observation can be questioned.

We can also see our conclusion summerized in the below correlation matrix created

In [164]:
revenue_corr = df_q2.copy(deep=True)
# 9 a create a correlation matrix
revenue_corr=df_q2.copy(deep=True)
revenue_corr['total_revenue'] = revenue_corr['total_revenue'].astype(str).astype('float64')
revenue_corr['total_budget'] = revenue_corr['total_budget'].astype(str).astype('float64')
revenue_corr.drop(columns=['id'], inplace=True)
revenue_corr =revenue_corr.corr()
In [165]:
# 9 b Plot the matrix using following configerations

plt.figure(figsize=(13,13))
revenue_corr_1=plt.matshow(revenue_corr,fignum=4)
plt.xticks(range(len(revenue_corr.columns)),revenue_corr.columns)
plt.yticks(range(len(revenue_corr.columns)),revenue_corr.columns)
plt.colorbar(revenue_corr_1,fraction=0.02)
plt.show();
<Figure size 936x936 with 0 Axes>

High Postive correlation between

  • Popularity and vote count
  • Popularity and total revenue
  • Vote count and total revenue

Low correlation between

  • Release year and vote average
  • Runtime and release year
  • Release year and revenue

Question 2 - Exploring trends.

Now having explored the relation between revenue and various other features, We shall explore more and see whether there are other trends present in the data.

Q2 - a) Which genres are the have the most popular genres over the years? How can I see these trends in the movies.
I have decided to use pie charts to describe the trends of the movies over the years.

In [210]:
# 10 a create new dataframe for this question.
df_q1 = df.copy(deep=True)

Cleaning the dataset according to requirments and only keeping those coloumns that are actually needed.

In [211]:
df_q1.drop(df_q1.columns.difference(['popularity', 'original_title','genres','release_year']), 1, inplace=True)

# 10 b Splitting the various genres using the delimeter 
df_q1.genres = df_q1.genres.str.split('|')

The below function explode that I have used is taken from stackoverflow. Which allows me to seperate the various movie rows according to their multiple genres. So for example if I have a movie which is of action and drama genre. This function would allow me to split the row into 2 rows each having a single genre of 'drama' and 'action'

In [212]:
# 10 c splits joint genres into two seperate genres "i.e., romance/comedy -> romance, comedy"
def explode(df, lst_cols, fill_value=''):
    # make sure `lst_cols` is a list
    if lst_cols and not isinstance(lst_cols, list):
        lst_cols = [lst_cols]
    # all columns except `lst_cols`
    idx_cols = df.columns.difference(lst_cols)

    # calculate lengths of lists
    lens = df[lst_cols[0]].str.len()

    if (lens > 0).all():
        # ALL lists in cells aren't empty
        return pd.DataFrame({
            col:np.repeat(df[col].values, lens)
            for col in idx_cols
        }).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
          .loc[:, df.columns]
    else:
        # at least one list in cells is empty
        return pd.DataFrame({
            col:np.repeat(df[col].values, lens)
            for col in idx_cols
        }).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
          .append(df.loc[lens==0, idx_cols]).fillna(fill_value) \
          .loc[:, df.columns]
In [213]:
#10 d Dropping null values and calling the explode function
df_q1.dropna(inplace = True);
df_q1 = explode(df_q1, ['genres'])
# What are the most popular genres 

Having extracted prelimenary results, we can now visualize which genre had the most movies over all the years. The below code and the subsequent graph does just that.

In [214]:
popular_genres = df_q1['genres'].value_counts()
popular_genres.plot('bar', figsize=(15,10), sort_columns='True', title='Popular Genres');

While the above graph does do a good job in explaning the overall trend of the genres popularity by summerizing the dataset, it does not show us the trends over the years where we can individually deduce and see which genre was most popular in one year and whether its popularity increased or decreased in the next or previous years. This year by year analysis cannot be done by the above bar chart therefore I decide to use pie charts which would enable us to indivually see genre trends over all the years.

In [216]:
# 11 a We first extract all the years present in the dataframe and sort them. 
years = df_q1['release_year'].unique()
years.sort()
# 11 b We then loop over these subsequent years and extract the movie genre from that particular year and group them by 
# their popularity(sum). We then plot each using a pie chart. 
for i in years:
    genre_popularity = df_q1[df_q1['release_year'] == i]
    genre_popularity = (genre_popularity.groupby('genres')['popularity'].sum()).sort_values(ascending=False)
    genre_popularity.plot(kind='pie', figsize=(7,7));
    plt.title(i)
    plt.show()

Q2 - b) Lets make another observation regarding voter count, the vote average and how it has changed over the years.

In [218]:
# 12 a
df_q3 = df.copy(deep=True)
df_q3 =df_q3.drop(df_q3.columns.difference(['vote_count','vote_average','release_year']), 1)
In [228]:
# 12 b group vote count by release year and find average 
votes =(df_q3.groupby('release_year', sort=True).mean())
# y = df_q3.groupby('genres')['popularity'].sum()
In [229]:
votes.head(4)
Out[229]:
vote_count vote_average
release_year
1960 77.531250 6.325000
1961 77.580645 6.374194
1962 74.750000 6.343750
1963 82.823529 6.329412
In [230]:
# 12 c graph the results for vote count
votes['vote_count'].plot(title='Average Vote count over the years');
In [227]:
# 12 d graph the results for average vote
votes['vote_average'].plot( title='Average vote over the years');

We note that the vote average count has increased steadily over the years however the vote average has only seen a spike in the mid 70s after which there was a decline(albiet with myraid fluctuations).

Conclusions

Limitations:

Data Wrangling

  1. Scope out dataset, "i.e., for null values, identify characteristic columns, potential correlatins, datatype, duplicate values, other discrepencies, etc.
  2. Remove null values
  3. Remove unnecessary columns
  4. Change datastypes accordingly
  5. Remove any duplicate value
  • A dataset of 10,000 uncleaned records is not enough by itself to discover trends within data. Considering the fact that we shall be cleaning the data and removing null values, the size of the data would reduce signifcantly. An example of this in particular is the 'revenue_adj' and the 'budget_adj' coloumns where almost 50% of the coloum values contain zero values. This large number of missing values dissables us to scale the values, or compute the local average and fill these values as they would reduce the reliablity of the dataset.

Hence the biggest limitation of this dataset is the fact that it contains a limited number of values and any conclusion that will be drawm would lack reliablity since the underlaying data is limited.

The second limitation in this IMDB dataset is the fact that there is also noise in the data. While I have been able to in various occasions remove outliers. This was always not possible considering the limited skill set I have.

Conclusions: While I have given my conclusions of the various features, the dataset itself and any correlations on various occassions in the study, I shall summerize my findings below.

To anwser the signifcant question we asked in the beginning of our analyis that

    Introduction

    'What features in the data are correlated to or effect the revenue of the movie?' (Question 1)

    We find (via Exploratory Data Analysis) that:

    • There is a significant correlation between the movie budget and revenue. The data suggests that on avearge, movies with a higher budget tend to perform better than those with a lower budget. However, the signifcant limitation when coming to a reliable conclusion was the lack of data we had. After cleaning the coloumn of zero values we were left with half of the orignial values. That coupled with the fact that there were a significantly higher amount of missing values of the movies present in the early 50 to 70s.

    • Popularity and revenue are also correlated. This seems an almost predictable conclusion considering that a popularity of a movie generally tends to produce a higher revenue. We now have data to support this inference.

    • The movies directed by directors Peter Jackson, Steven Spielberg and Micheal Bay on average have a marked higher revenue than the remaining directors. It is also interesting to note that Peter Jackson though has directed a sizable number of movies, still has directed less than the top 15. This goes to suggest that he appears to be a competent director .

    This however is not enough to come to a definite conclusion that all famous directors(or those who have directed a sizable number of movies) generate the highest revenue. Even though the data does show a relationship between top directors producing more revenues for a movie than other directors.

    • The trends of revenue earned by the movies over the years show a certain spike in the mid 1960s to 1980s , however dropped to around 200 M and remained almost on average the same from the 1980s too the 2000s . This observation is not very credible as there were a lot of </b> missing values in the data from 1960s to 1980s.

    As mentioned numerous times earliers, any inference/observation which also takes in consideration the movie revenue is not very reliable considering the lower number of data.

    Exploring other trends within the data (Question 2)

    The following trends were found in the data:

    • Drama and Comedy are the most popular movie genres in total.
    • Voter count on average has increased significantly over the years though the average vote has seen a gradual decline after the mid 1970s.
    • The top diretors which directed the most movies include Spielberg, Clint Eastwood and Ridley Scott have directed a marked more number of movies than other directors. However Peter Jackson seems to have generated the most revenue which in fact seems plausible since he directed the infamous 'Lord of the Rings' movie trilogy among significant others.

    Reference:
    1)Please note that the 'explode' function used to seperate the movie genres from the genre coloumns was taken from the following stackoverflow question
    https://stackoverflow.com/questions/12680754/split-explode-pandas-dataframe-string-entry-to-separate-rows
    2) I also drew inspiratin from the following Kaggle analysis. It was interesting to note that the original dataset contains far more features and the limitless possiblities of data wrangling that can be done.
    https://www.kaggle.com/aninda123/imdb-movie-analysis


In [ ]: